{
 "metadata": {
  "name": "",
  "signature": "sha256:9ab802c5a0881222286c849fd84d0113517693cf0cc2602a7fafa8caebb0608f"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Naive Bayes Classification\n",
      "2014-07-25\n",
      "\n",
      "Scott Hendrickson (@DrSkippy)"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Intuitively, some classification problems seem succeptible to a strategy based on the likelihood of our sample being in a given class based on probability of the presense of the features. \n",
      "\n",
      "For example, the likelihood of an email with the phrase \"Nigerian Prince\" being spam is much higher than an email with the phrase \"This is your monther. Why haven't you called?\"\n",
      "\n",
      "In terms of probabilities, we want to calculate $p(class | features)$ for each possible class and then choose the class with the max probability.\n",
      "\n",
      "In order to estimate the probabilities we need to calculate $p(class | features)$, we can use a the technique of labeling some data, then calculating the probabilites."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Bayes' Theorem"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "How to proceed? Given a labeled training set, it is straight forward to calculate $p(features|class)$. \n",
      "\n",
      "Flipping this over to get what we want looks like a job for Bayes' Theorem:\n",
      "\n",
      "$$p(x)p(y|x) = p(y)p(x|y)$$\n",
      "\n",
      "Use your algebra:\n",
      "\n",
      "$$p(features)p(class|features) = p(class)p(features|class)$$\n",
      "\n",
      "$$p(class|features) = \\frac{p(class)p(features|class)}{p(features)}$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Chain Rule"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "This seems pretty managable, but there is one tiny hitch. There are potentially many dimensions to each feature vector, ${f_0, f_1, \\ldots ,f_n}$. This means we have,\n",
      "\n",
      "$$p(features|class) = p({f_0, f_1, \\ldots ,f_n}|c)$$\n",
      "\n",
      "Think about only 2 dimentions for a moment. If the features relate to the class independently of each other, we can rewrite:\n",
      "\n",
      "$$p({f_0, f_1}|c) = p(f_1|c, f_0)p(f_0|c)$$\n",
      "\n",
      "But since the probablility of $f_1$ is independent of $f_0$, we can repeat this operation through the fulll set of features to collapse to,\n",
      "\n",
      "$$p({f_0, f_1}|c) = p(f_1|c)p(f_0|c)$$\n",
      "\n",
      "So in general we can write:\n",
      "\n",
      "$$p({f_0, f_1, \\ldots, f_n}|c) = \\Pi^n_i p(f_i|c)$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Naive Bayes Classification"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Start with a representative labeled training set. Calculate $p(c_i)$ for each class. Calculate $p(f_j|c_i)$ for all feature dimensions. Classify new samples by finding the most likely class given the features:\n",
      "\n",
      "$$p(class_i|{f_0, f_1, \\ldots, f_n})= \\frac{p(class_i)}{p({f_0, f_1, \\ldots, f_n})} \\Pi^n_j p(f_j|c_i)$$"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Problem 1: Classify even/odd binary numbers by the number of 0s and 1s.\n",
      "\n",
      "* How many dimensions in the feature vector?\n",
      "* Why might this work? \n",
      "* Why shouldn't it work?\n",
      "* How well do you think this will work?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "truth = [\"odd\", \"even\"]  # uses some human readiable text to make reading results easier\n",
      "training_data = [ \n",
      "        (\"{:05d}\".format(\n",
      "            int(str(bin(x)).replace(\"0b\",\"\")))\n",
      "            , truth[int(x%2 == 0)]) \n",
      "                for x in range(1,32)]\n",
      "training_data"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 1,
       "text": [
        "[('00001', 'odd'),\n",
        " ('00010', 'even'),\n",
        " ('00011', 'odd'),\n",
        " ('00100', 'even'),\n",
        " ('00101', 'odd'),\n",
        " ('00110', 'even'),\n",
        " ('00111', 'odd'),\n",
        " ('01000', 'even'),\n",
        " ('01001', 'odd'),\n",
        " ('01010', 'even'),\n",
        " ('01011', 'odd'),\n",
        " ('01100', 'even'),\n",
        " ('01101', 'odd'),\n",
        " ('01110', 'even'),\n",
        " ('01111', 'odd'),\n",
        " ('10000', 'even'),\n",
        " ('10001', 'odd'),\n",
        " ('10010', 'even'),\n",
        " ('10011', 'odd'),\n",
        " ('10100', 'even'),\n",
        " ('10101', 'odd'),\n",
        " ('10110', 'even'),\n",
        " ('10111', 'odd'),\n",
        " ('11000', 'even'),\n",
        " ('11001', 'odd'),\n",
        " ('11010', 'even'),\n",
        " ('11011', 'odd'),\n",
        " ('11100', 'even'),\n",
        " ('11101', 'odd'),\n",
        " ('11110', 'even'),\n",
        " ('11111', 'odd')]"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from operator import itemgetter"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_even = len([itemgetter(0)(x) for x in training_data if itemgetter(1)(x) == \"even\"])\n",
      "n_odd = len([itemgetter(0)(x) for x in training_data if itemgetter(1)(x) == \"odd\"])\n",
      "p_even = float(n_even)/len(training_data)\n",
      "p_odd = float(n_odd)/len(training_data)\n",
      "print \"p(even) = {}  p(odd) = {}\".format(p_even, p_odd)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "p(even) = 0.483870967742  p(odd) = 0.516129032258\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_zeros = sum([itemgetter(0)(x).count(\"0\") for x in training_data])\n",
      "n_ones = sum([itemgetter(0)(x).count(\"1\") for x in training_data])\n",
      "total_characters = n_zeros + n_ones\n",
      "p_zero = float(n_zeros)/total_characters\n",
      "p_one = float(n_ones)/total_characters\n",
      "print \"total characters : {}   0s : {}  1s : {}\".format(total_characters, p_zero, p_one)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "total characters : 155   0s : 0.483870967742  1s : 0.516129032258\n"
       ]
      }
     ],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "n_zeros_even = sum([itemgetter(0)(x).count(\"0\") for x in training_data if itemgetter(1)(x) == \"even\"])\n",
      "n_ones_even = sum([itemgetter(0)(x).count(\"1\") for x in training_data if itemgetter(1)(x) == \"even\"])\n",
      "p_zero_given_even = float(n_zeros_even)/(n_zeros_even + n_ones_even)\n",
      "p_one_given_even = float(n_ones_even)/(n_zeros_even + n_ones_even)\n",
      "print (\"p(0|even) = {}   P(1|even) = {} \".format(p_zero_given_even,p_one_given_even))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "p(0|even) = 0.573333333333   P(1|even) = 0.426666666667 \n"
       ]
      }
     ],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "n_zeros_odd = sum([itemgetter(0)(x).count(\"0\") for x in training_data if itemgetter(1)(x) == \"odd\"])\n",
      "n_ones_odd = sum([itemgetter(0)(x).count(\"1\") for x in training_data if itemgetter(1)(x) == \"odd\"])\n",
      "p_zero_given_odd = float(n_zeros_odd)/(n_zeros_odd + n_ones_odd)\n",
      "p_one_given_odd = 1.0 - p_zero_given_odd\n",
      "print (\"p(0|odd) = {}   P(1|odd) = {} \".format(p_zero_given_odd,p_one_given_odd))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "p(0|odd) = 0.4   P(1|odd) = 0.6 \n"
       ]
      }
     ],
     "prompt_number": 7
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Simplification  \n",
      "\n",
      "$p({f_0, f_1, \\ldots, f_n}) = const.$ for all calcuations, so we can omit it from the classification calculation."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def p_odd_given(sample):\n",
      "    n_zeros = sample.count(\"0\")\n",
      "    n_ones = sample.count(\"1\")\n",
      "    return p_odd * (p_zero_given_odd**n_zeros) * (p_one_given_odd**n_ones)\n",
      "  \n",
      "def p_even_given(sample):\n",
      "    n_zeros = sample.count(\"0\")\n",
      "    n_ones = sample.count(\"1\")\n",
      "    return p_even * (p_zero_given_even**n_zeros) * (p_one_given_even**n_ones)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "correct = 0\n",
      "for s, t in training_data:\n",
      "    star = \"\"\n",
      "    if truth[int(p_odd_given(s) < p_even_given(s))] == t:\n",
      "        correct += 1\n",
      "        star = \"*\"\n",
      "    print \"Pattern: {} --> p(odd) = {:1.3f} p(even) = {:1.3f} predict = {:4s} truth = {:4s} {}\".format(\n",
      "            s\n",
      "            , p_odd_given(s)\n",
      "            , p_even_given(s)\n",
      "            , truth[int(p_odd_given(s) < p_even_given(s))]\n",
      "            , t\n",
      "            , star)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Pattern: 00001 --> p(odd) = 0.008 p(even) = 0.022 predict = even truth = odd  \n",
        "Pattern: 00010 --> p(odd) = 0.008 p(even) = 0.022 predict = even truth = even *\n",
        "Pattern: 00011 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = odd  \n",
        "Pattern: 00100 --> p(odd) = 0.008 p(even) = 0.022 predict = even truth = even *\n",
        "Pattern: 00101 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = odd  \n",
        "Pattern: 00110 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 00111 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 01000 --> p(odd) = 0.008 p(even) = 0.022 predict = even truth = even *\n",
        "Pattern: 01001 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = odd  \n",
        "Pattern: 01010 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 01011 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 01100 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 01101 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 01110 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = even \n",
        "Pattern: 01111 --> p(odd) = 0.027 p(even) = 0.009 predict = odd  truth = odd  *\n",
        "Pattern: 10000 --> p(odd) = 0.008 p(even) = 0.022 predict = even truth = even *\n",
        "Pattern: 10001 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = odd  \n",
        "Pattern: 10010 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 10011 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 10100 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 10101 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 10110 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = even \n",
        "Pattern: 10111 --> p(odd) = 0.027 p(even) = 0.009 predict = odd  truth = odd  *\n",
        "Pattern: 11000 --> p(odd) = 0.012 p(even) = 0.017 predict = even truth = even *\n",
        "Pattern: 11001 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = odd  *\n",
        "Pattern: 11010 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = even \n",
        "Pattern: 11011 --> p(odd) = 0.027 p(even) = 0.009 predict = odd  truth = odd  *\n",
        "Pattern: 11100 --> p(odd) = 0.018 p(even) = 0.012 predict = odd  truth = even \n",
        "Pattern: 11101 --> p(odd) = 0.027 p(even) = 0.009 predict = odd  truth = odd  *\n",
        "Pattern: 11110 --> p(odd) = 0.027 p(even) = 0.009 predict = odd  truth = even \n",
        "Pattern: 11111 --> p(odd) = 0.040 p(even) = 0.007 predict = odd  truth = odd  *\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"=\"*50\n",
      "baseline_correct = sum([1 for x in training_data if itemgetter(1)(x) == \"odd\"]) # if we guessed all odd\n",
      "print \"Baseline (guess all \\\"odd\\\")\"\n",
      "print \"Total patterns: {}  Total correct: {} ({:2.2%})\".format(\n",
      "        len(training_data)\n",
      "        , baseline_correct\n",
      "        , float(baseline_correct)/len(training_data))\n",
      "print \"=\"*50\n",
      "print \"Naive Bayes Classifier\"\n",
      "print \"Total patterns: {}  Total correct: {} ({:2.2%})\".format(\n",
      "        len(training_data)\n",
      "        , correct\n",
      "        , float(correct)/len(training_data))\n",
      "print \"=\"*50"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "==================================================\n",
        "Baseline (guess all \"odd\")\n",
        "Total patterns: 31  Total correct: 16 (51.61%)\n",
        "==================================================\n",
        "Naive Bayes Classifier\n",
        "Total patterns: 31  Total correct: 21 (67.74%)\n",
        "==================================================\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Thoughts: \n",
      "\n",
      "* Comparing to best naive baseline might not be the same as comparing to random.\n",
      "* Comparing to best naive baseline might reveal pathologies of your measurement/cost assumptions or poorly balanced training set.\n",
      "* Is it crazy to not use ideas of mechanism (i.e. do a 1 bit check on the LST of the binary number!)?\n",
      "* How should we think about choosing features for machine learning when we have only tenuous ideas of mechanism?"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Problem 2: Classify Tweets by Topics"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Make two sets of tweets with simple filters, don't work too much about what is in them.\n",
      "\n",
      "    ./search_api.py -f\"lang:en (dog OR cat)\" json | gnacs.py | cut -d\"|\" -f3  | sort -u | head -qn85 > pets.txt\n",
      "    ./search_api.py -f\"lang:en (hat OR coat)\" json | gnacs.py | cut -d\"|\" -f3  | sort -u | head -qn85 > hats.txt"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "corpus = open(\"./pets.txt\",\"rb\").readlines()\n",
      "n_pets = len(corpus)\n",
      "corpus.extend(open(\"./hats.txt\",\"rb\").readlines())\n",
      "n_hats = len(corpus) - n_pets\n",
      "Y = [0]*n_pets + [1]*n_hats\n",
      "print \"corpus size = {}   n pets = {}   n hats = {}\".format(len(corpus), n_pets, n_hats)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "corpus size = 170   n pets = 85   n hats = 85\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print corpus[2]\n",
      "print corpus[12]\n",
      "print corpus[20]\n",
      "print corpus[130]\n",
      "print corpus[150]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "@ATLTurnUpRadio can I send u some music big dog\n",
        "\n",
        "**DARLING KITTEN DUO OF DELIGHT NEEDS A LOVING CAT PARENT - PLEASE GRANT PANDA &amp; LAYLA A DEATH ROW PARDON!!!***... http://t.co/RhriqIFz1Q\n",
        "\n",
        "How chubby is this cat btw! Hahaha http://t.co/fwziGFOlce via\n",
        "\n",
        "RT @hat_films: I liked a @YouTube video http://t.co/wyVzJByRLd BAFTA Young Game Designers: Winners Ceremony\n",
        "\n",
        "RT @XFilesbutemoji: \u0111\u009f\u008e\u0093 \u0111\u009f\u0091\u015ascully guess what!  \u0111\u009f\u0091\u0160you finally got your high school diploma \u0111\u009f\u008e\u0093 \u0111\u009f\u0091\u015ano, i found this funny hat \u0111\u009f\u008e\u0093 \u0111\u009f\u0091\u015ai'm a wizard!\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "### Scikit Learn"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Like with all of the other other fundamental algorithm RSTs, don't use the code indended to demonstrate how the algorithm works. Instead, use Scikit Learn.  \n",
      "\n",
      "In this case, we take a big shortcut by using the term frequency inverse document frequency (tfidf) vectorizer. Recall, this weights each term by the number of appeances, but decreases the weights as it appears in too many documents to be informative.  Scikit Learn uses sparse matricies for space efficiency."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.feature_extraction.text import TfidfVectorizer\n",
      "vec = TfidfVectorizer(min_df=2)\n",
      "X = vec.fit_transform(corpus)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print corpus[101]\n",
      "print X[101]\n",
      "print corpus[11]\n",
      "print X[11]\n",
      "print corpus[121]\n",
      "print X[121]\n",
      "print corpus[131]\n",
      "print X[131]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Im gonna eat my hat\n",
        "\n",
        "  (0, 141)\t0.331415548236\n",
        "  (0, 98)\t0.642536490438\n",
        "  (0, 84)\t0.253884720489\n",
        "  (0, 53)\t0.642536490438\n",
        "**DARLING KITTEN DUO OF DELIGHT NEEDS A LOVING CAT PARENT - PLEASE GRANT PANDA &amp; LAYLA A DEATH ROW PARDON!!!***... http://t.co/DVpyGOWrus\n",
        "\n",
        "  (0, 191)\t0.256294694664\n",
        "  (0, 177)\t0.230333844711\n",
        "  (0, 169)\t0.256294694664\n",
        "  (0, 168)\t0.256294694664\n",
        "  (0, 167)\t0.256294694664\n",
        "  (0, 154)\t0.155036695572\n",
        "  (0, 145)\t0.241674302167\n",
        "  (0, 126)\t0.256294694664\n",
        "  (0, 114)\t0.256294694664\n",
        "  (0, 110)\t0.256294694664\n",
        "  (0, 95)\t0.0947816953278\n",
        "  (0, 79)\t0.256294694664\n",
        "  (0, 52)\t0.256294694664\n",
        "  (0, 45)\t0.256294694664\n",
        "  (0, 44)\t0.256294694664\n",
        "  (0, 42)\t0.256294694664\n",
        "  (0, 36)\t0.0969446325662\n",
        "  (0, 34)\t0.152777596803\n",
        "  (0, 7)\t0.185841337931\n",
        "@Radiant_Nea Oh. I put a pen spring on mine where it plugs into my phone. I've also seen ppl coat that part in super glue to reinforce it...\n",
        "\n",
        "  (0, 252)\t0.298884791373\n",
        "  (0, 239)\t0.257804274945\n",
        "  (0, 229)\t0.134562725663\n",
        "  (0, 216)\t0.199673803711\n",
        "  (0, 210)\t0.298884791373\n",
        "  (0, 197)\t0.268609864175\n",
        "  (0, 182)\t0.281834836566\n",
        "  (0, 174)\t0.298884791373\n",
        "  (0, 171)\t0.298884791373\n",
        "  (0, 158)\t0.170899378363\n",
        "  (0, 156)\t0.268609864175\n",
        "  (0, 141)\t0.145368314893\n",
        "  (0, 104)\t0.337325270555\n",
        "  (0, 100)\t0.257804274945\n",
        "  (0, 99)\t0.149988834856\n",
        "  (0, 37)\t0.203498786128\n",
        "RT @Jumpman23: No matter what hat you wear, tip it to The Captain. #RE2PECT https://t.co/lHFpcyYK86\n",
        "\n",
        "  (0, 260)\t0.207663974331\n",
        "  (0, 250)\t0.294436331325\n",
        "  (0, 247)\t0.364710821436\n",
        "  (0, 229)\t0.174132065466\n",
        "  (0, 228)\t0.386774463741\n",
        "  (0, 217)\t0.159618238609\n",
        "  (0, 192)\t0.162309551008\n",
        "  (0, 185)\t0.386774463741\n",
        "  (0, 149)\t0.364710821436\n",
        "  (0, 104)\t0.218259350076\n",
        "  (0, 96)\t0.364710821436\n",
        "  (0, 84)\t0.144107776504\n",
        "  (0, 36)\t0.14629919797\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.naive_bayes import MultinomialNB\n",
      "clf = MultinomialNB()\n",
      "clf.fit(X, Y)\n",
      "print(clf.predict(X[12]))\n",
      "print(clf.predict(X[22]))\n",
      "print(clf.predict(X[32]))\n",
      "print(clf.predict(X[120]))\n",
      "print(clf.predict(X[130]))\n",
      "print(clf.predict(X[140]))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[0]\n",
        "[0]\n",
        "[0]\n",
        "[1]\n",
        "[1]\n",
        "[1]\n"
       ]
      }
     ],
     "prompt_number": 20
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "correct = 0\n",
      "for i, doc in enumerate(corpus):\n",
      "    if Y[i] == clf.predict(X[i]):\n",
      "        correct += 1\n",
      "    else:\n",
      "        print i, Y[i], clf.predict(X[i]), doc\n",
      "print \"correct = {} ({:%})\".format(correct, float(correct)/len(corpus))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "correct = 170 (100.000000%)\n"
       ]
      }
     ],
     "prompt_number": 21
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##How does it generalize?\n",
      "\n",
      "* Where does the opportunity for this model to generalize come from? Compare this to bias-variance tradeoff in Logistic Regression.\n",
      "* How well do you think this model generalizes?"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": true,
     "input": [
      "print \"PETS\"\n",
      "print clf.predict(vec.transform([\"My dog has fleas\"]))  # seems clear\n",
      "print clf.predict(vec.transform([\"Grab your cat and moat\"])) # also clear\n",
      "print clf.predict(vec.transform([\"This pet is a hamster\"])) # not in the original word list, but pet nonetheless\n",
      "print clf.predict(vec.transform([\"socks can be the name of a pet\"])) # pet\n",
      "print clf.predict(vec.transform([\"But hats is not usually the name of a pet\"])) # pet\n",
      "print \"OUTERWARE\"\n",
      "print clf.predict(vec.transform([\"Grab your hat and coat\"])) # obvious\n",
      "print clf.predict(vec.transform([\"My socks and shoes are dirty\"])) # obvious but too far?\n",
      "print clf.predict(vec.transform([\"These come in hard, tinfoil, thinking and mortar boards\"])) # tricky\n",
      "print clf.predict(vec.transform([\"Redhat's logo is a fedora\"])) # ok\n",
      "print clf.predict(vec.transform([\"The cat in the hat\"])) # mixed, confound it!\n",
      "print clf.predict(vec.transform([\"Hiding a haircut under there?\"])) # infered?\n",
      "print clf.predict(vec.transform([\"Mackinaw or Toque?\"])) # Too far?"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "PETS\n",
        "[0]\n",
        "[0]\n",
        "[0]\n",
        "[1]\n",
        "[1]\n",
        "OUTERWARE\n",
        "[1]\n",
        "[1]\n",
        "[1]\n",
        "[1]\n",
        "[1]\n",
        "[0]\n",
        "[0]\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}